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Abstract — We study the intrinsic limitations of sequential 
convex optimization through the lens of feedback information 
theory. In the oracle model of optimization, an algorithm 
queries an oracle for noisy information about the unknown 
objective function, and the goal is to (approximately) minimize 
every function in a given class using as few queries as possible. 
We show that, in order for a function to be optimized, the 
algorithm must be able to accumulate enough information 
about the objective. This, in turn, puts limits on the speed of 
optimization under specific assumptions on the oracle and the 
type of feedback. Our techniques are akin to the ones used 
in statistical literature to obtain minimax lower bounds on 
the risks of estimation procedures; the notable difference is 
that, unlike in the case of i.i.d. data, a sequential optimization 
algorithm can gather observations in a controlled manner, so 
that the amount of information at each step is allowed to change 
in time. In particular, we show that optimization algorithms 
often obey the law of diminishing returns: the signal-to-noise 
ratio drops as the optimization algorithm approaches the 
optimum. To underscore the generality of the tools, we use our 
approach to derive fundamental lower bounds for a certain 
active learning problem. Overall, the present work connects the 
intuitive notions of "information" in optimization, experimental 
design, estimation, and active learning to the quantitative notion 
of Shannon information. 

Index Terms — Convex optimization, Fano's inequality, 
feedback information theory, hypothesis testing with controlled 
observations, information-based complexity, information- 
theoretic converse, minimax lower bounds, sequential 
optimization algorithms, statistical estimation. 



I. Introduction 

MANY problems arising in such areas as communica- 
tions and signal processing, contrtol, machine learning, 
economics, and many others require solving mathematical 
programs of the form 

min{/(a:) : x £ X}, (1) 

where / : W 1 —> R is a convex objective function and X is 
a compact, convex subset of R n . Therefore, it is important to 
have a clear understanding of the fundamental limits on the 
efficiency of convex programming methods. 
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A systematic study of these fundamental limits was initiated 
in the 1970's by Nemirovski and Yudin 1 1 1. In their framework, 
an optimization algorithm is a sequential procedure that re- 
peatedly queries a black-box oracle for information about the 
function being optimized, each query depending on the past 
information. The oracle may be deterministic (for example, 
giving the value of the function and its derivatives up to some 
order at any point) or stochastic. This leads to the notion 
of information-based complexity, i.e., the smallest number of 
oracle calls needed to minimize any function in a given class 
to a desired accuracy. The results in [ 1 1 are very wide in 
scope and cover a variety of convex programming problems 
in Banach spaces; finite-dimensional versions are covered in 
@ and @. 

For deterministic oracles, Nemirovski and Yudin derived 
lower bounds on the information complexity of convex pro- 
gramming using a "counterfactual" argument: given any al- 
gorithm that purports to optimize all functions in some class 
T to some degree of accuracy e using at most T oracle calls, 
one explicitly constructs, for a particular history of queries and 
oracle responses, a function in T which is consistent with this 
history, and yet cannot be e-minimized by the algorithm using 
fewer than T oracle calls (see also 0). A similar approach 
was also used for stochastic oracles. 

Proper application of this method of resisting oracles re- 
quires a lot of ingenuity. In particular, the stochastic case 
involves fairly contrived noise models, unlikely to be en- 
countered in practice. In this paper, which expands upon our 
preliminary work HI, we will show that the same (and many 
other) lower bounds can be derived using a much simpler 
information-theoretic technique reminiscent of the way one 
proves minimax lower bounds in statistics |5)-||7]. Namely, 
we reduce optimization to hypothesis testing with controlled 
observations and then relate the resulting probability of error 
to information complexity using Fano's inequality and a series 
of mutual information bounds. These bounds highlight the 
role of feedback in choosing the next query based on past 
observations. One notable feature of our approach is that it 
does not require constructing particularly "strange" functions 
or noise models. Moreover, we derive a "law of diminishing 
returns" for a wide class of convex optimization schemes, 
which says that the decay of optimization error is offset by 
the decay of the rate at which the algorithm can reduce its 
uncertainty about the objective function. 

The idea of relating optimization to hypothesis testing is 
not new. For instance, Shapiro and Nemirovski [8] derive 
a lower bound on the information complexity of a certain 
class of one-dimensional linear optimization problems by 
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reducing optimization to a binary hypothesis testing problem 
pertaining to the parameter of a Bernoulli random variable (the 
outcome of a coin toss). The reduction consists in showing 
that any good optimization algorithm can be converted into 
an accurate estimator of the coin bias based on repeated 
independent trials; then one can derive the lower bound on the 
information complexity (equivalently, the minimum necessary 
number of coin tosses) from the data processing inequality 
for divergence (or Fano's inequality). This approach was 
recently extended to multidimensional optimization problems 
by Agarwal et al. J9j, flO) . Like the present paper, their work 
uses information-theoretic methods to derive lower bounds on 
the oracle complexity of convex optimization, and their results 
are qualitatively similar to some of ours. However, what sets 
our work apart from [8 ]-| 10 1 is that we explicitly account for 
the controlled manner in which the algorithm interacts with the 
oracle. This, in turn, allows us to derive tight lower bounds on 
the rate of error decay for certain types of infinite-step descent 
algorithms, which is not possible with the reduction to coin 
tossing. 

Sequential procedures have become increasingly popular in 
the field of machine learning, mostly due to the abundance 
of data and the resulting need to perform computation on- 
line. Convex optimization is not the only sequential setting 
being studied: recent research in machine learning has also 
focused on such scenarios as active learning, multi-armed 
bandits, and experimental design, to name a few. In all these 
settings, one element is common: each additional "action" 
should provide additional "information" about some unknown 
quantity. Translating this intuitive notion of "information" into 
precise information-theoretic statements is often difficult. Our 
contribution consists in offering such a translation for convex 
optimization and closely related problems. 

A. Notation 

Given a continuous function / : X — >• M on a compact 
domain X C E™, we denote by /* its minimum value over X: 



r 



inf f{x) 



We will use several basic notions from nonsmooth convex 
analysis fll) . The sub differential of / at x, denoted by df(x), 
is the set of all g € R™, such that 

f(y)> f(x)+g T (y-x), VyeK". 

Any such g is a subgradient of / at x. For a convex /, the 
subdifferential df(x) is always nonempty. When \df[x)\ = 1, 
its only element is precisely the gradient V/(x). By ||x|| p we 
denote the £. p norm of x € W 1 ; the £2 norm will also be 
denoted by || • ||. By B™ we denote the unit ball in M™ in the 
£ p norm. The £2 -diameter of X is defined as 



D\ = sup ||x — x'\\. 

x,x'ex 

The n x n identity matrix will be denoted by /„. 

All abstract spaces are assumed to be standard Borel 
(i.e., Borel subsets of a complete separable metric space), and 
will be equipped with their Borel er-fields. If Z is such a space, 



then £>z will denote the corresponding er-field. All functions 
between such spaces are assumed to be measurable. If Zi and 
Z 2 are two such spaces, then a Markov kernel fl2) , fT3] from 
Zi to Z2 is a mapping P : Bz 2 x Zi — >• [0, 1], such that for any 
z\ G Z2 P(-\zi) is a probability measure on (Z2, Bz 2 ) and for 
any B £ Bz 2 P(B\-) is a measurable function on Z%. We will 
use the standard notation P(dz2\z\) for such a kernel. 

We will work with the usual information-theoretic quan- 
tities, which are well-defined in standard Borel spaces p4) . 
Given two (Borel) probability measures P and Q on Z, their 
divergence is 



D( 




if P<Q 
otherwise 



where the notation P <C Q means that P is absolutely 
continuous w.r.t. Q, i.e., Q(B) — for any B £ Bz 
implies that ¥(B) = as well. If Z is a product space, 
Z = Zi x Z2, then the conditional divergence between two 
probability distributions P and Q on Z given ¥ Zl (the Zi- 
marginal of P) is 

D{¥ Z2 \ Zl \\Q> Z2 \ Zl \F Zl ) 

F Zl (d Zl )D(¥ Z2lZl H^HQz^ (-1*0) , (2) 

Zi 

where V Z2 \ Zl and Q z .,\ Zl are any versions of the regular 
conditional probability distributions of Z2 given Z\ under P 
and Q, respectively. This definition extends in the obvious 
way to situations when Zi or Z 2 are themselves product 
spaces. Thus, if P and Q are two probability distributions 
for a random triple (Z\, Z2, Z3) taking values in a product 
space Z = Zi x Z2 x Z3, such that, under Q, Z2 and Z3 are 
conditionally independent given Z\, i.e., Qz 3 |Zi,z 2 = Qz 3 |Zi 
O-a.s., then we will write 



D(F Z3lZuZ jq Z3lZuZ j¥ ZuZ2 ) 



D( 



z 3 \z 1 ,z 2 \\M>z 3 \z 1 \^z 1 ,z 2 



(3) 



Given a random couple (Zi,Z 2 ) <E Z with probability distri- 
bution P, the mutual information between Z\ and Z2 is 



I{Z 1 ;Z 2 ) =D( 



z 2 \z x 



Given a random triple (Zi, Z 2 , Z%) e Zi x Z 2 x Z 3 , the 
conditional mutual information between Z2 and Z3 given Z\ 
is 



I(Z 2 ; Zs\Zx) = D(¥ Z2<Z3{Zl ||P Z2 , Zl ® V ZslZl 
= D(F Z3lZuZ2 \\F Z3lZl \V Zl , Z2 ), 



(4) 



where Q follows from Bayes' rule and from {3). In other 
words, the conditional mutual information 1{Z2\Z^\Z\) is 
given by the conditional divergence between the joint distri- 
bution of Z\,Z%, Z3 and the distribution under which Z2 and 
Z3 are conditionally independent given Z\. 
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II. Sequential optimization algorithms and their 

INFORMATION-BASED COMPLEXITY 

The work of Nemirovski and Yudin [1] deals with funda- 
mental limitations of sequential optimization algorithms in the 
real-number model of computation. The basic setting is as 
follows. We have a class T of convex functions / : X — > K 
on some compact convex domain X C R n . We seek an 
"optimal" algorithm that would solve the optimization problem 
([TJ with a given guarantee of accuracy regardless of which 
/ € J 7 were to be optimized. The algorithms of interest 
operate by repeatedly querying an oracle for information 
about the unknown objective / at appropriately selected points 
in X and then combining the accumulated information to 
form a solution. The notion of optimality of an algorithm 
pertains to the number of queries it makes before producing 
a solution, without regard to the combinatorial complexity of 
computing each query. In other words, we are interested in 
the information-based complexity (IBC) |15| , |16| of convex 
optimization problems. 

The theory of IBC is concerned with intrinsic difficulty of 
computational problems in terms of the minimum amount of 
information needed to solve every problem in a given class 
with a given guarantee of accuracy. The word "information" 
here does not refer to information in the sense of Shannon, 
but rather to what is known a priori about the problem being 
solved, as well as what an algorithm is allowed to learn during 
its operation. There are three aspects inherent in this notion of 
information — it is partial, noisy, and priced. Let us explain 
informally what these three terms mean in the context of 
optimization by means of a simple example. 

Let X = [0, 1], and consider the function class 

F={fe{x) = \\x-9\ 2 :0exj. (5) 

We wish to design an algorithm that minimizes every f = f$ € 
T to a given accuracy e > 0. At the outset, the only a priori 
information available to the algorithm consists of the problem 
domain X, the function class T, and the desired accuracy e. 
The algorithm is allowed to query the value and the derivative 
of / at any finite set of points {xi, . . . , xt} C X before 
arriving at a solution, which we denote by xt+i- The queries 
are answered by an oracle, i.e., a (possibly stochastic) device 
that knows the function / (or, equivalently, the parameter 9) 
and responds to any query x e X with Y(8, x, w) € M 2 , where 
a; is a random element from some probability space (f2, B, P) 
that represents oracle noise. The random variable Y (9, x, oj) is 
assumed to be a noisy observation of the pair (fg(x), f' 9 {x)). 
For concreteness, let us suppose that 

Y(6,x,u)=Y(e,x,(W,Z)) 

= (fe(x) + WJ' e (x) + Z) 

= (^\x-9\ 2 + W,x-8 + Z^j , (6) 

where W and Z are an i.i.d. pair of Af(0, a 2 ) random 
variables. 

The interaction of the algorithm and the oracle takes place as 
follows. Let {(Wt, Z t )}^i be an i.i.d. sequence. At time t = 



1,2,..., the algorithm computes the query X t as a function 
of the past queries X T) 1 < r < t — 1 and the corresponding 
oracle responses Y T = Y(9, X T , (W T ,Z T )), 1 < r < t- 1. At 
time t — the algorithm knows only that / e J; this repre- 
sents the a priori information. At time t > 1, the algorithm 
acquires additional data (X\ Y t ) = ((Xi, Yi), . . . , (X t , Y t )), 
and so can refine its a priori information. At every time step, 
the information is partial in the sense that there are (potentially 
infinitely) many functions consistent with it, and it is also noisy 
due to the presence of the additive disturbances (W*, Z r ). 

Formally, for the example outlined above, an algorithm that 
makes T queries (or a T-step algorithm) is a tuple A — {At ■ 
X*- 1 x Y*- 1 -» XyJ+i, where Y = R 2 , so that, for 1 < t < T, 
X t = AtiX 1 - 1 .Y 1 - 1 ) is the query at time t, and X T +i = 
At+i(X t ,Y T ) is the solution. We assume that information 
is priced in the sense that the algorithm is charged some fixed 
cost c > for every query it makes. Thus, it is desired to keep 
the number of queries to a minimum. With this in mind, we 
can define the IBC for a given accuracy e > as 

IBC(e) = inf |t > 1 : 3A = {A t }^ 

s.t. sup[E/ e (X T+1 )-/ e *] <el, 
eex J 

where the expectation is taken w.r.t. the noise process 

{(Wt : Z t )} c £L 1 . For this particular problem it can be shown 
that 

fl, o 2 =0,ee [0,1/2) 

IBC(e) = I Q , a 2 > 0, e e (0, 1/2) (7) 

[o, £>l/2 

The first entry (a 2 — 0,0 < e < 1/2) follows because 
the algorithm can just query x\ = 0, obtain the response 
yi = ((l/2)6 2 ,—6), and immediately compute X2 = 9; the 
last entry (a 2 > 0,e > 1/2) follows because the maximum 
value of any fg € T on X is at most 1/2. The intermediate 
regime (a 2 > 0,e £ (0,1/2)) is more involved. The main 
contribution of the present paper is a unified information- 
theoretic framework for deriving lower bounds on the IBC of 
arbitrary sequential algorithms for solving convex program- 
ming problems. 

A. Formal definitions 

The above discussion can be formalized as follows: 

Definition 1. A problem class is a triple V — (X, T, O) 
consisting of the following objects: 

1) A compact, convex problem domain X C K"; 

2) An instance space J-, which is a class of convex functions 
f : X ->■ M; 

3) An oracle O — (Y, P), where Y is the oracle information 
space and P(dy\f,x),dy € By,f € T, x € X, is a 
Markov kernel 

'Recall that J 7 is a subset of C(X), the space of all continuous real- valued 
functions on X. Equipped with the usual sup norm, C(X) is a separable 
Banach space, so a Markov kernel from J 7 X X into Y is well-defined. 
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Some restrictions must be imposed in order to exclude oracles 
that are "too informative," an extreme example being Y = 
J 7 x X and P(dy\f,x) = 6f x (dy). One way to rule this out 
is to require the oracle in question to be local (TJ: 

Definition 2. We say that an oracle O is local if for every 
x G X and every pair f.f £ J such that f = f in some 
open neighborhood of x, we have 

P(dy\f,x)=P(dy\f',x), Wye By. 

It is easy to see that the oracle described right before the 
definition is not local. Indeed, fix a point ieX and consider 
any two functions /, /' G T that agree on some open neigh- 
borhood of x, but are not equal outside this neighborhood. 
Then P(dy\f,x) = 6 f , x (dy), but P(dy\f',x) = S r<x (dy), 
which violates locality. Most oracles encountered in practice 
are local (see, for instance, the examples in Section |TTT). 

To gain more insight into stochastic oracles, we can appeal 
to the basic structural result for Markov kernels: If Zi and Z2 
are standard Borel spaces, then any Markov kernel P(dz2\z\) 
from Zi to Z2 can be realized in the form Z2 — Q(zi,W), 
where W is a random variable uniformly distributed on [0, 1] 
and $ : Z x x [0, 1] — > Z 2 is a measurable mapping fl3] 
Lemma 3.22]. Thus, for any stochastic oracle P(dy\f,x) we 
can find a deterministic oracle ip : JxX — > U with some infor- 
mation space U and a measurable mapping $ : U x [0, 1] -> Y, 
such that P can be realized as 

Y = $ty(J,x),W) (8) 

with W as above. Thus, P will be local in the sense of 
Definition [2] whenever its "deterministic part" ip is local. 

Next, we make the notion of an optimization algorithm 
precise. In this paper, we deal only with deterministic algo- 
rithms, although all the results can be easily extended to cover 
randomized algorithms as well (cf. [ 1 1 for details): 

Definition 3. A T-step algorithm for a given V = (X, J 7 , O) is 

a sequence of mappings A — {At '■ X t_1 x Y*^ 1 — > X}^ 1 . 
The set of all T-step algorithms for V will be denoted by 
% T {V). 

The interaction of any A G $It(V) with O, shown in Figure [T] 
is described recursively as follows: 

1) At time t = 0, a problem instance / G T is selected by 
Nature and revealed to O, but not to A. 

2) At each time t = 1,2, ...,T: 

. A queries O with X t = A t (X'" 1 , where 
(X T ,Y T ) G X x Y is the algorithm's query and the 
oracle's response at time r < t — 1. 

• O responds with a random element Y t G Y accord- 
ing to P(dY t \f,X t ). 

3) At time t = T + 1, A outputs the candidate minimizer 
X T+1 =A T+ i(X T ,Y T ). 

We can view the set-up of Figure [T] as a discrete-time stochas- 
tic dynamical system with an unknown "parameter" / 6 T, 
input sequence {X t }, and output sequence {Y t }. The objective 
is to drive the system as quickly as possible to an e-minimizing 
state, i.e., any ieX such that f(x) — f* < e, for every / G T. 





Oracle 

P{dY t \f,X t ) 


Y, 

* 

Delay 


Algorithm 

A t+1 {X\Y l ) 


Xt+i 















Fig. 1. Interaction of an algorithm A and an oracle O. 

We are interested in the fundamental limits on the speed with 
which this can be done. Defining the error of A G %t{V) on 
/ € T by 

err A (T,f) 4 f(X T+1 ) - mff(x) = f(X T+1 ) - f, 
we introduce the following definition: 

Definition 4. Fix a problem class V = (X, J 7 , O). For any 

r > 1, e > 0, and 8 G (0, 1), we define the rth-order (s,6)- 
complexity and the e-complexity of V, respectively, as 

K%\e, S) = inf |t > 1 : 3A G 2t T (P) 

s.t. supPr(err^(T,/)>e) < s}; 

K^\s) == inf |t > 1 : 3A G 2l T (P) 
s.t. supEerr^(T, /) < e\. 

When the underlying problem class V is clear from context, 
we will write simply K^(e, 5) and K^ r \e). Moreover, when 
r = 1 we will simply write K-p(-) or K(-). 

The following is immediate from definitions (the proof is in 
Appendix [B) : 

Proposition 1. For any P, r > 1, e > 0, 6 G (0, 1), 

Kp{e/5,5)<K%\e). 

The complexities \e, 5) and K^' '(e) capture the intrin- 
sic difficulty of sequential optimization over the problem class 
V using any finite-step algorithm. However, most iterative 
optimization algorithms used in practice (such as stochastic 
gradient descent) are not run for a prescribed finite number 
of steps. Instead, they are run for however many steps are 
necessary until a desired accuracy is reached. Moreover, the 
error of the successive candidate minimizers produced by 
such an algorithm should decay monotonically with time. This 
observation motivates the following definitions: 

Definition 5. A weak infinite-step algorithm for V = 

(X, O) is a sequence of mappings A = {At ■ X t_1 x 
Y*" 1 — > X}^. The set of all weak infinite-step algorithms 
for V will be denoted by < & C o{'P). 

Definition 6. Given a problem class V — (X, J- ', O) and some 
r > 1, an algorithm A G Stool'P) is r-anytime // 

erf ^ (t, F) = sup E err^(t, f ) as t ^ 00. (9) 
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We can now ask about fundamental limits on the rate of 
convergence in |9): 

Definition 7. For any problem class V, we define the r- 
anytime exponent as 

sup 



> : 3A E ^(V) 
s.t. limsupi 7 ■ efr^'^, J 7 ) < oo| 

t->oo "' > 



According to the above definitions, the candidate minimizer 
X t +i produced by a weak infinite-step algorithm A G %i o('P) 
after t queries is simultaneously the query at time t + 1. 
Many algorithms used in practice, such as stochastic gradient 
descent, are weak infinite-step algorithms. A more general 
class of algorithms, which we may call strong infinite-step al- 
gorithms, would also include strategies in which the process of 
issuing queries (i.e., gathering information about the objective) 
is separated from the process of generating candidate minimiz- 
ers. Stochastic gradient descent with trajectory averaging B), 
fTTJ is an example of such a strong algorithm. We do not 
consider strong infinite-step algorithms in this paper (except 
for a brief discussion in Appendix El, although their study is 
an interesting and important avenue for further research. 

III. Examples of problem classes and preview of 

SELECTED RESULTS 

The following six examples show the variety of settings cap- 
tured by our framework, ranging from "standard" optimization 
problems to such scenarios as parameter estimation, sequential 
experimental design, and active learning. 

Example 1. Given L > 0, let be the set of all convex 
functions / : X — > K that are L-Lipschitz, i.e., 

\f(x)-f(y)\<L\\x-y\\, Vx,yeX. 

Let Y = I x R n and let P(dy\f,x) be a point mass 
concentrated at (f(x),g(x)), where, for each x G X, g(x) 
is an arbitrary subgradient in df(x). This oracle provides 
noiseless first-order information. When L = 1, we will write 
J" Lip instead of 7^ ip . 

Example 2. Take J^p as arj ove, but now suppose that the 
oracle responds with 

Y={f(x) + W,g(x) + Z), 

where ffel and Z G R" are zero-mean random variables 
with finite second moments. Thus, any algorithm receives 
noisy first-order information, and the oracle is local. 

Example 3. Given n > 0, let J 7 ^ be the set of all differen- 
tiable functions / : X —> M that are K-strongly convex, i.e., 

f(x) >f(y) + Vf(y) T (x-y) + '^\\x-y\\ 2 , Vx, y eX. 
As in the previous example, the oracle responds with 

Y=(f(x) + W,g(x) + Z), 

where Wei and Z G M. n are zero-mean random variables 
with finite second moments. When k = 1, we will write J-"sc 
instead of 



Example 4. Fix a compact convex set X C K™ and a family 
of probability measures {Pg : 8 G X} on (Y, By). Consider 
the class of convex functions 



T = {fg(x):8eX}, 



(10) 



such that for every 8 G X fg(8) = min^gx fe(x). Consider 
also the oracle O = (Y, P), defined by 

P(dy\f e ,x)=P e (dy), V(0,x)eXxX. (11) 

This oracle ignores the query x and simply outputs a ran- 
dom element Y ~ Pg. The problem class (X, T, O) thus 
describes the statistical problem of estimating the parameter 
of a probability distribution. More generally, we can consider 
the function class 

I = ife(-) = E 9 [F(; Y)} = jT F(-,y)P e (dy) : 8 € X j , 

(12) 

where we assume that: 

• For each fixed y £ Y, the function x H> F(x, y) is convex 

• fe{0) = ™n xeX fg(x) 

The second condition says that F:XxY— >Kisa contrast 



function [18]. Most classical problems in statistical inference, 
such as estimating the mean, the median, or the variance of 
a distribution, can be cast as minimizing a convex contrast 
function of the form ( [T2| . For instance, if X C t, Y = 1, 
E [Y] = 8 for each 9 G X, and F(x, y) = (x - y) 2 , then 

fe(x) = E e {(Y~x) 2 } 

with fg(x) > fg(8) = vax(Pg), so we recover the problem of 
estimating the mean. 

Example 5. As we have just seen, the queries are of no 
use in statistical estimation since the samples the statistician 
obtains depend only on the unknown parameter 8. By contrast, 
the setting in which the statistician's queries do affect the 
observations is known as sequential experimental design [19|- 
plj. Consider the case when X C M ra is compact and convex, 
as in the above example. Suppose also that we have two 
families of probability measures on Y, {Qg : 8 G X} and 
{Pe.x ■ (0, x) G X x X}. The function class is as in ( fT~2] > but 
with Qg replacing Pg, while the oracle now is defined by 



P(dy\fe,x) 



5 (dy), V(9,i)eXxX. 



Thus, the role of Qg is to provide a measure of performance (or 
goodness-of-fit) of the final estimate of 8, while Pg x describes 
the experimental model (i.e., the relationship between the input 
X and the response Y given the parameter 8). 

Example 6. Our last example is at the intersection of statis- 
tical learning theory and sequential experimental design. Let 



X= [0,1], Y = {-!,+!}, T = {fg(x) 



8\:8e X}. 



sc- 



To define the oracle, suppose that there exist some < c, C < 
1/2 and k G [1, oo), such that 

clx-S^- 1 < |P(y = l|/ fl ,a!)-l/2| < C\x-9\ K -\ 

where the first inequality holds for all a; in a sufficiently small 
neighborhood of 8. This oracle provides a noisy subgradient 



6 



of fg at x, and the amount of noise depends on the distance 
between x and 9. This problem class is related to active 
learning of a threshold function on the unit interval [22], and 
will be treated in detail in Section |VI] 

We now briefly discuss some of the lower bounds that arise 
from the techniques introduced in the paper. First, Theorem [JJ 
in Section [V] implies a general lower bound of the form 

il(n a log(l/e)) 

on the number of oracle calls required to e-minimize every 
function in a given class, where the exponent a > depends 
on the geometry of the problem domain X and on the complex- 
ity of the instance space T . For convex Lipschitz functions and 
noiseless first-order oracles (Example [TJ, or more generally for 
stochastic oracles that are sufficiently "informative" in a sense 
we make precise, this lower bound holds with a = 1 (cf. the 
discussion right after Theorem [TJ. This lower bound is known 
to be optimal in the noiseless case (TJ and in certain noisy 
scenarios when n = 1 (23j ; however, our techniques lead to a 
much more transparent proof of the bound. 

For the noisy first-order oracle with zero-mean Gaussian 
noise of variance a 2 , we obtain lower bounds of the form 



Here we do not aim at obtaining tight rates for specific settings 
of interest, but rather show the connections to the techniques 



n( 



l (l/e) a2 



where the exponent ct\ depends, as before, on the geometry 
of X, on the complexity of T, as well as on whether the 
oracle supplies full first-order information (function value and 
subgradient) or just the subgradient. The exponent depends 
on the details of the function class T . More specifically: 

• for J^ip (Example [2jl, we have c*2 = 2 (Theorem [2] in 
Section [V); 

• for Tsc (Example BJ, we have Q!2 = 1 (Theorem [3] in 
Section [V}. 

The corresponding result for convex Lipschitz functions in 
n = 1 can be found in (TJ, (8), yet we obtain the optimal 
dependence on n for higher dimensions. Our lower bound 
for strongly convex functions seems to be new; in particular, 
Nemirovski and Yudin (TJ only consider the noiseless case, 
while Agarwal et al. (9), JlO| consider noisy first-order oracles, 
but with a different oracle model, which does not allow 
additive noise due to a coin-tossing construction. Ignoring the 
dependence on the dimension, we also obtain the error decay 
rate fl(<j 2 /t) for Tsc when we restrict ourselves to anytime 
infinite-step algorithms (Theorem|5]in Section[VT|i. To the best 
of our knowledge, such analysis does not appear anywhere else 
in the literature. The bounds of Eq. |7]i essentially capture the 
fundamental limits of strongly convex programming in one 
dimension and can be easily deduced using our techniques (a 
sketch of the derivation is given in Section |IV-A) . We also 
derive new (and tighter) lower bounds on anytime algorithms 
for minimizing higher-order polynomials under a second- 
moment error criterion (Theorems [6] and [7] in Section |VI[ ). 

Apart from "standard" optimization problems, our frame- 
work seamlessly captures several statistical problems with an 



employed in statistics. Finally, we show in Section VI-C that 
our methodology leads to a particularly easy derivation of a 
lower bound for the active learning problem of Example [6] 
This bound was previously obtained in (22) using a much 
more involved argument relying on a careful construction of 
a "difficult" subset of functions. 

Overall, our main contributions are the development of a 
general framework that captures many diverse settings with 
optimization flavor, as well as a novel analysis that takes 
into account the effect of feedback upon the dynamics of the 
interaction between the algorithm and the oracle. 

IV. Setting the stage: optimization vs. hypothesis 

TESTING WITH FEEDBACK 

We now lay down the foundations of our information- 
theoretic method for determining lower bounds on the infor- 
mation complexity of convex programming. The basic strategy 
is to show that the minimum number of oracle queries is 
constrained by the average rate at which each new query can 
reduce the algorithm's uncertainty about the function being 
optimized. 

Conceptually, our techniques are akin to the ones used 
in statistical literature to obtain minimax lower bounds on 
the risks of estimation procedures [5|-[7|. The main idea is 
this. Given a problem class V = (X, T, O), we construct a 
"difficult" finite subclass T' = {/o, . . . , /jv-i} C T, such 
that the functions in it are nearly indistinguishable from one 
another based on the information supplied by the oracle in 
response to any possible query, and yet they are sufficiently 
far apart from one another, so that a candidate approximate 
minimizer for any one of them fails to minimize all the 
remaining functions to the same accuracy. Once such a class is 
constructed, we consider a fictitious situation in which Nature 
selects an element of T' uniformly at random. Then for every 
T-step algorithm A £ ^iriV) we can construct a probability 
space (il,B,P) with the following random variables defined 
on it: 

• M £ {0, . . . , N — 1}, which encodes the random choice 
of a problem instance in T 1 

• X T+1 £ X T+1 , where X T are the queries issued by A 
and Xt+i is the candidate minimizer 

• Y T £ Y T are the responses of O to the queries issued 
by A. 

These variables describe the interaction between Nature, the 
algorithm, and the oracle, and thus have the causal ordering 

M, Xi, Yi, . . . , X t , Yt, . . . , Xt, Yt, Xt+i, 
where, P-almost surely, 

P(M = i) = -J- (13) 



optimization flavor. In particular, in Section V-D we look at 
information-based complexity of statistical estimation and se- 
quential experimental design (Examples |4] and [5] respectively). 



V(X t £ A\M, X* -1 , Y* -1 ) = l {At(xt - KYt - 1)eA} 
P(Y t e B\M, X, Y t_1 ) = P(B\f M ,X t ), 

for all i £ {0, . . . , N - 1}, A £ B x , B £ By. In other words, 
M-> (X*- 1 ^*" 1 ) -> X t and (l'" 1 ,^ 1 ) ->■ (M,X t ) -> 
Y t are Markov chains for every t. 



7 



The reason for such punctilious bookkeeping is that now 
we can relate the problem faced by A to sequential hypothesis 
testing with feedback, as defined by Burnashev [24]. We can 
think of M as encoding the choice of one of N equiprobable 
hypotheses. At each time t, the algorithm issues a query X t 
and receives an observation Y t which is stochastically related 
to X t and M via the kernel W(dY t \M, X t ) = P(dY t \ f M ,X t ). 
The current query may depend only on the past queries and 
observations. At time T+l, the algorithm produces a candidate 
minimizer, Xt+i- As we will shortly demonstrate, we can use 
the information available to A at time T + 1 to construct an 
estimate Mt of the true hypothesis Once this is done, 
we can analyze the mutual information I(M; M T ), which is 
well-defined because we have specified P. In particular, the 
analysis hinges on the following observations. Suppose that 
A is such that for some r > 1, e > 0, and i5 £ (0, 1) we have 



Pr^err^(T,/)>ej < 8, V/e7 

where the probability is w.r.t. the randomness in the oracle's 
responses. Then, first of all, 



•(en£(T,/ w )>e) < 6. 



We will use this fact, together with the "geometric" distin- 
guishability of the functions {fi}, to show that F(Mt ^ 
M) < 8 and, as a consequence, that there exists some 
*i(r,E, 5) > 0, such that 



I(M;M T ) > *i(r,£,(5). 



(14) 



In other words, a good algorithm should be able to obtain a 
nontrivial amount of information about the hypothesis M. On 
the other hand, by the data processing inequality, I(M; Mt) < 
I(M; X T , Y T ), and we will use statistical indistinguishability 
of {fi}, as well as the structure of the oracle, to obtain an 
upper bound of the form 

I(M;M T ) <TV 2 (r,s) (15) 

with some ^(j, e) < +oo. The two bounds are then com- 
bined to yield 



T > 



*a(r,e) 



(16) 



A. An illustrative example 

To illustrate our method in action, we will sketch the 
derivation of the nontrivial part of the lower bound in ([71, 
i.e., when e £ (0, 1/2). Let 



- V2s, e £ (0, 1/8) 
ee [1/8,1/2) 

I1/2 + V&, ee (0,1/8) 
Xl \l, ee [1/8,1/2). 

2 It is important to keep in mind that the hypothesis testing set-up is purely 
fictitious — indeed, A may or may not know that the problem instances are 
drawn at random among {/o, . . . , /jv_i}, rather than arbitrarily from the 
entire instance space T. The point is, though, that the average performance of 
A on T' cannot be better than its worst-case performance on T. In statistical 
terms, the minimax risk of A over T is bounded below by the Bayes risk 
over any subset of T . 



It is easy to see that x\^x\ £ [0, 1]. Consider two functions 

fm(x) = ~(.X-X* m ) 2 , TOG {0,1}. 

A simple calculation shows that for any x £ [0, 1] such that 

fo{x) - fo = fo(x) = -(x - x* ) 2 < e 

we must have /i(x) — /* = fi(x) > e, and the same holds 
with the roles of fo and f% reversed. Thus, any e-minimizer 
of fo fails to e-minimize f-y, and vice versa. 

On the other hand, the probability distribution of the output 
of the first-order Gaussian oracle |6| for any query x £ [0, 1] 
when M = is very close to its M = 1 counterpart. Indeed, 
letting Y £ M 2 denote the output of the oracle, we have 

^Y\M=m,x=x =A/" Q(z-a4)V 2 ^) ®Af(x-x* m ,cr 2 ). 
Then it is not hard to show that, for m £ {0,1}, 

D(F Y \M= m ,x= x \\VY\M=i- m ,x= x ) = O (^) (17) 

In other words, the functions f Q and fx are nearly indistin- 
guishable from one another based on the outcome of a single 
query. 

Now suppose that Nature selects an index M £ {0, 1} 
uniformly at random. Consider a T-step algorithm that e- 
minimizes every function in the class T defined in (|5]l with 
probability at least 1 — 8, where 8 £ (0, 1/2). Let r = 1. Then 
Lemma [T| in Section ITV-CI can be used to show that the lower 
bound ( fl4] > holds with 

*i(l,e,*)=log2-/i a (*)>0, 

where hi{8) = —S\og8 — (1 — <5)log(l — 8) is the binary 
entropy function. On the other hand, using Lemmas [2] andj4] 
as well as Eq. ( [T7] >, we can show that the upper bound ( [15) 
holds with 

* 2 (l,e)= max max D(¥ Y \M=m,x=x\\PY\M=i-m,x=x) 
me(o,i}ie[o,i] 



-<>(£)■ 



Hence, according to ( |16) , any T-step algorithm that e- 
minimizes every function in the class T of Eq. ([5]) with 
probability at least 1 — 8 must satisfy 



T = fl 



a 2 (\og2-h 2 (S)) 



From this and from Proposition [T[ we can obtain the lower 
bound n{a 2 /e) of Eq. The matching upper bound 

0(a 2 je) is achieved by stochastic gradient descent |3|. 

B. Reduction to hypothesis testing with feedback 

We now develop our information-theoretic methodology in 



the general setting of Section II-A| 

Let us fix a problem class V = (X, J 7 , O). To set up our 
analysis, we first endow the instance space T with a "distance" 
d(-, ■) that has the following property: for any x £ X and any 

e > 0, 



d(f,g)>2e mdf(x)<f* + e 



g(x)>g*+e. (18) 
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In other words, an e-minimizer of a function cannot simul- 
taneously be an e-minimizer of a distant function. It is easy 
to construct a d satisfying ([18) for any particular class T of 
continuous functions, although such a d need not be a metric. 
For example, if we consider the class 

= {fe(x) = \\x - 9\\ : 9 e &} 



= \\6 - 6'\\ satisfies ([18i 
< e imply \\x — 



> e 



for some C X, then d(fg , fg 
Indeed, \\6 - 6'\\ > 2s and ||a; - 
by the triangle inequality. For a general T, we can also define 

d(f,g)±mflf(x)+g(x)}-[t+g*}, 

the distance-like function introduced in (9), |10| . This defini- 
tion coincides with d(fg, — \\8 — 9'\\ for the parametric set 
Tq\ however, ( fl"8"] l is the most general requirement. Note that 
we will often implicitly restrict our consideration to a subclass 
of T and define an appropriate d on that subclass. 

Let us fix the exponent r > 1 and consider any finite T' = 
{/o, . • • , /jv-i} C J-, such that any two distinct fi, fj G T' 
are at least 2e 1 / r apart in d{-, •). Given any T e fj and an 
algorithm A G ^■t(T- > ), we can now construct the probability 
space (ft, B, P), as described in the introduction to this section. 
Given Xt+i, the output of A, we can define the "estimator" 

M T (X T ,Y T ) 4 argmin [fi(X T +i) ~ /*], (19) 

i=0,...,iV-l 

which simply selects that function in T' for which the error of 
Xt+i is the smallest. Since Xt+i is <r(X T , y T )-measurable, 
the estimator Mt is indeed a function only of the information 
available to A after time T. 

C. Information bounds 

The main object of interest will be the mutual information 
I(M;Mt). We first show that any "good" T-step algorithm 
obtains a nonzero amount of information about M at the end 
of its operation: 

Lemma 1. Fix some r > 1, 8 G (0, 1/2), and e > 0. Suppose 
A G SIt('P) attains 



supPr(err^(T,/) 



(20) 



Lef J 7 ' d J- be a finite set {/o, . . . , /jv— 1} of functions, such 
that 

d{f l J ] )>2e 1 ' r , Mi^j. 

Let M be uniformly distributed on {0, 1, . . . , JV — 1}, antJ 
suppose that A is fed with the random problem in stan ce Jm G 
J 7 '. If N > 4, then the estimator Mt defined in ( 19 1 satisfies 
the bound 

I(M; M T ) > (1 - 8) log JV — log 2 > 0. (21) 
If N = 2, then 

I(M; M T ) > log 2 - h 2 (8) > 0, (22) 

where /i2 (<5) — —SlogS — (1 — <5)log(l — 6) is the binary 
entropy function. 



Remark 1. In the sequel, we will consider only the cases 
when the set T' is either "rich", so that JV 3> 4, or has only 
two elements, so JV = 2. 

Proof: Consider an algorithm A with the claimed prop- 
erties. Define, for each i, the event 

Ei 4 {err^(T, /,)>£}. 

We first show that the event {Mt 7^ implies E^. Indeed, if 
Ei does not occur, then from the fact that d(fi, fj) > 2e 1 / r 
for all j / i and from ( fT8| we deduce that 

fj(X T+1 ) - f* > e 1 ^ > fi(X T+ i) - ft, Vj ? i 

so it must be the case that Mt = i- Therefore, 

8 > max Pf^lM = i) 

~ i=0,...,JV-l 

> max V(M T +i\M = i) 

i=0,...,N-l 

> F(M T ^ M). 

Now suppose that JV > 4. Then we can invoke the following 
version of Fano's inequality p5) : 

I(M; M T ) + log 2 



P(M r ^ M) > 1 



logJV 



Rearranging, we get pT) . When JV = 2, we use a stronger 
form of Fano's inequality (see, e.g., Section 2.10 in ]26|): 

h 2 (F(M T ^ M)) > log 2 - I(M; M T ). 

Since 5 n- h>2{8) is monotone increasing on [0, 1 /2], we get 
h 2 (8) > log 2 - 7(M; JVf T )- Rearranging, we get (22l. ■ 
On the other hand, the amount of information I\M\ Mt) 
cannot be too large: 

Lemma 2. Any estimator M : X T x Y T -> {Oj . . . , JV - 1} 



[and, in particular, the estimator Mt defined in (19\] satisfies 

t 

I{M;M) < ^/(M^p^Y*- 1 ). (23) 
t=i 

Remark 2. The terms I(M;Y t \X t ,Y t ^ 1 ) have analogues 
in the literature on information-theoretic experimental design 
(see, e.g., 1 19|, |20|). In that context, they represent the average 
reduction of uncertainty about the unknown variable M after 
observing the experimental outcome Y t based on the design 
point X t = AtiX*- 1 ^- 1 ). 



Proof: We have 

I{M;M) < I(M;X T ,Y T ) 



(24) 
(25) 



= Y,I{M\X t ,Y t \X t -\Y t - 1 ) 
t=i 
t 

= 5}J(M; XtlX*- 1 .Y*- 1 ) + I(M; Y t \X l , Y 1 ' 1 )] (26) 



T 



= J2HM;Y t \X\Y t - 1 ) 



(27) 



is a consequence of the data processing inequality; 
26} use the chain rule; and ( |27| ) uses the fact that 
M -> {X^.Y*- 1 ) -> X t is a Markov chain. ■ 
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D. Refinement of the upper bounds 

Lemmas [T] and [2] are the two main elements of our approach. 
In order to apply them, we need to get a handle on the 
conditional mutual information terms on the right-hand side of 
( |23| l. The following two lemmas, whose proofs can be found 
in Appendix [B] give us just the right tools for that: 

Lemma 3. Consider any estimator M : X T x Y T — > 
{0, . . . , N— 1}. Then, considering any realization of the oracle 
O in the form dHJ, we have the bound 

T 

I(M;M)<Y,I(Ut;Yt), 
t=i 

where U t — ip(fMjX t ) is the output of the "deterministic 
part" of the oracle. 

Lemma 4. Consider any estimator M : X T x Y T — > 
{0, . . . , N — 1}. For any sequence of conditional probability 
measures {Qy-jx*,}'*- 1 }?Li on (^i<8) satisfying the condi- 
tions 



we have the bound 



t = 1, 



(28) 



T 

7(M;M)<53D( 
i=l 



FtlM.X*,!'*- 1 Ilvytix',^*- 1 I^AT.x'.r 4 - 



(29) 

Remark 3. By hypothesis on the behavior of the oracle, 
(X*- 1 ,^- 1 ) -> (M,Xj) -> y t is a Markov chain. Hence, 
Py t | M xt yt-i in ( p9l > can be replaced with Py t |M,jr t - 

The key to using Lemma [4] is in the judicious choice of the 
"comparison" measures Qy t \x* tY*- 1 ■ m particular, we will use 
two different strategies of choosing the Q's, which in turn lead 
to two different types of bounds: 

• Information Radius (IR) bound — This bound is useful 
for analyzing arbitrary finite-time algorithms. For each t, 



take 



iYtlX^Y* 



ZY t \X, 



to be the mixture 



m\x t 



i £-1 

N ^ 

i=0 



Y t \M=i,Xf 



Then, letting M' denote an independent copy of M and 
noting that Qy t \x t = ^M'^Y t \M>,x t , we obtain 

I(M; M) 

T 

= D(P Yt]M} x t \\EM^Y t \M',X t \PM,X t ) 



t=l 
T 



< 



®M>D{V Yt \M,X t \\^Y t \M>,X t |PM,X t ) (30) 



t=l 
T 



J2^M,x t EM>D(F Yt 



\M,X t \\^Y t \M',Xt, 



< T muxsnp D(¥ Y \m=i,x=x\\Py\m=j,x=x), (31) 



where ( |30| l follows from Jensen's inequality and convex- 
ity of the divergence. The use of the term "information 



radius" is inspired by an analogous concept in the theory 
of information-based complexity p6) : the divergence 
D{f Y \ M =uX=xWY\M=j,x=x) quantifies how close, in 
a statistical sense, the oracle's responses are for a given 
query point x £ X and a given pair Viewing the 
random variable Y ~ Py\M=i.x=x as (stochastic, noisy) 
information about the function /; at the point x, we can 
interpret the quantity multiplying T in ( |3T| > as a measure 
of ambiguity of this information. We use IR bounds in 
Section [V] 

• Lyapunov Function (LF) bound — This bound is useful 
for analyzing anytime algorithms. It relies on the idea 
that, with certain types of problem classes, the oracle 
responds with "pure noise" whenever the query point 
happens to hit upon a minimizer. In other words, there 
exists a probability measure Q* on Y, such that 



P(dy\f,x) 



if x G arg min /, 
x 



where argmin x J = {x E X : f(x) = /*}. Moreover, 
for an anytime algorithm it is often the case that the con- 
ditional divergence D(Py t |M,xJQyjF?r t ). where Q*, is 
an independent copy of Q* , decreases with t, and hence 
can be thought of as a Lyapunov function for the problem 
at hand (in fact, Lyapunov functions of the divergence 
type have been used before to analyze the convergence 
of specific stochastic optimization algorithms p7)). This 
leads to the natural choice of QyjIx'.y'- 1 — Qy t ^ an d t0 
the bound 



I(M;M)<J2D( 



Y t \M:X t \\V>Y t WM,X 



(32) 



We apply the LF bounds in Section VI to the study of 

anytime algorithms. 
We should point out that the use of an auxiliary measure Q has 
been pioneered by Yang and Barron [6| (with later refinements 
by Yang |28 1) in the usual setting of statistical estimation from 
i.i.d. data; there, the relevant bound was of the form 

I(M;Y T ) < D(P y t im \\Q y t\P m ) 

<maxZ)(P y T| M=ro ||Q r T) 

(cf. (61 p. 1571]). Similarly, the "symmetrization trick" in- 
volving an independent copy of M and an application of 
Jensen's inequality, as in Eq. ([30} above, is used often in 
the statistics literature (cf. [5] and references therein). Our 
innovation consists in first performing a sequential decom- 
position of the mutual information I(M; X T ,Y T ), carefully 
taking into account all the Markov structures that arise due to 
the causality constraints that must be obeyed by the algorithm, 
and then choosing an appropriate auxiliary measure for each 
time t = l,...,T. 

V. Lower bounds for arbitrary algorithms 

We now apply the lemmas of the preceding section to the 
problem of deriving lower bounds on the information-based 
complexity of several problem classes. These bounds hold for 
arbitrary finite-step or infinite-step algorithms. 
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A. A general information-theoretic lower bound 

Our first bound applies to any problem class. However, this 
generality comes at a price: the bound is nontrivial (i.e., tight) 
only in certain cases. 

Theorem 1. Consider a problem class V = (X, J 7 , 0), with 
any realization of the oracle O in the form d8). Given any 
f] > 0, define the packing number 

N{T, d, V ) 4 max {n > 1 : 3/ , . . . , f N _ t e T 
s.t.d(f i J j )>2r ) ,Vi^j}. 

Then, for any r > 1, any e such that N(J-,d,e 1 / r ) > 4, and 
any 5 € (0, 1/2), the following bounds hold: 



4 r) (*)> 



C* 
1 

c* 



(1-5) log^(J-,d,e 1 / r ) -log 2 
I log N(F,d, (3e) 1 /'')-log2 



(33) 
(34) 



with 



C* = sup I(U;Y), 

where the supremum is over all random variables U taking 
values in Ux.r = ^(J 7 , X), and the mutual information is 
between U and Y = W), cf. Eq. ([SJ. 

Remark 4. The number C* is the Shannon capacity of 
the random transformation Y = §>(U,W) when its input is 
constrained to lie in the information space of the deterministic 
oracle if). When C* — oo, the bounds <J33J and ( [34] > simply 
say that JC^O, <$) > and K$\e) > 0. 

Proq/:- Let jf } = {/o,...,/jv-i} C T, N = 
N(J 7 ,d,e 1 ' r ), be a maximal packing set in T. Given 5 e 
(0, 1/2), consider any T and any algorithm A € SItC^) such 
that Pr(err^(T, /) > e) < S. Then we can apply Lemma [T] 
to get 

I(M; M T ) > (1 - 8) logiV - log 2. 
On the other hand, from Lemma [3] and the definition of C* , 

T 

J(M; Mr) < ^ ^t! ^t) < TC*. 
t=i 



Combining these two bounds, we get < |33j >, while ( |34j > follows 
after applying Proposition [T] with 5 = 1/3. ■ 
As an example, let X = B 1 ^ and T = Tup (cf. Example [lj. 
Consider the case r = 1. Let A e be a maximal 2e-packing of 
X in £2. A simple volume counting argument shows that 

\A s \>v-Hl/er, 

where v n = vo^-BJ). Consider the subclass of J-Li P consisting 
of all functions of the form fg(x) — \\x — 0\\,9 £ A e . Then 
for any two distinct functions fg , fgi we will have 

d(fe,f e >) = P-e'\\>2e, 



so Afpxip, d, e) > w n 1 (l/e) n . Theorem [T] then gives the 
following lower bound for any noisy oracle with C* < +00: 



K v (e) =0 nlog 



For noiseless first-order oracles, the same lower bound follows 
from a binary search argument, and can be achieved using the 
(computationally infeasible) method of centers of gravity Q], 
(2). In order to achieve this bound with a noisy oracle, an 
algorithm must pose queries that reduce the uncertainty by an 
amount that is independent of e. This is possible with certain 
kinds of oracles G2), (23), (291. 



B. First-order oracles with Gaussian noise 

If the oracle provides noisy first-order information, the 
above logarithmic lower bound can be tightened significantly. 
We now present lower bounds for two problem classes - con- 
vex Lipschitz functions (cf. Example [2]i and strongly convex 
functions (cf. Example [3j - when the oracle supplies first- 
order information corrupted by additive white Gaussian noise. 
This is an oracle that, for a function / and a query point x, 
responds with 



Y=(f(x) + W,g(x) + Z), 



(35) 



where, as before, g(x) is an arbitrary subgradient in df(x), 
and W ~ A/"(0, tr 2 ), Z ~ N(0,a 2 I n ) are mutually indepen- 
dent. We will refer to this oracle as the first-order Gaussian 
(FOG) oracle. We will also consider the subgradient-only 
Gaussian (SOG) oracle Y = g(x) + Z. For simplicity, we 
will assume that the algorithm knows the structure of the 
deterministic selector mapping (/, x) 1— > g(x) £ df(x), 
since this knowledge can only help. We will see that the e- 
complexities of these problem classes have polynomial depen- 
dence on 1/e, but differ in their dependence on the problem 
dimension n. Special cases of these results for linear functions 
for n = 1 and r = 1 can be found, for example, in (8]. 
The exponent of 1/e in the bounds will, generally, depend 
on the smoothness of functions in F. We remark that noise 
variance for each coordinate of the subgradient is a constant 
a 2 , implying that the expected squared £2 norm of the noisy 
subgradient scales linearly with n. We shall keep this in mind 
when considering achievability of the lower bounds by specific 
algorithms. It is also straightforward to treat the case when 
Z ~ A/"(0, (a 2 /n)I n ), i.e., when a 2 bounds the total (as 
opposed to per-coordinate) noise variance. 

We begin by particularizing the IR bound pT) to the oracles 
under consideration (the proof is given in Appendix [B| : 

Lemma 5 (IR bounds for Gaussian oracles). 

r T r l2 



— ■= max sup 

2cH i,j x< z 



I(M; M) < < 



T 



t 2a 2 i,j x6X 



v{[.n{x)-.f 3 (xf 

X l 

+ \\ 9i {x) - 9j (x) || 2 } (FOG) 
(SOG) 
(36) 



2 max sup \\gi(x) - gj(x)\\ 2 



We can now address the complexity of minimizing Lipschitz 
convex functions over a compact domain (cf. Example [2}: 
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Theorem 2. Consider the problem class V = (X, J-'hip, O) 
with a Gaussian oracle, where X C R™ with n > 16. Define 

s x = max{s > : sB^ C X}. 



Then for any r > 1, e < ysxyn/&) , and 5 G (0, 1/2), f/ze 
following bounds hold: 

((1 - 8)n - 8)r7s 2 log 2 



128(ns2 



1) 



' l " ' S ((l-g)n-8) n4log2 a 2 
128 



<7 

7 

£ 2/r 



(FOG) 



(SOG) 



(37) 

Proof: By the Varshamov-Gilbert bound (see, 
e.g., Lemma 2.9 in j7j), there exists an n/8-packing of 
size N > 2"/ 8 > 4 of the binary cube {-l,+l} n in the 
Hamming distance. In other words, there exists a subset 
{£ , ■ • - ,6v-l} of the vertices of B 1 ^ with N > 2"/ 8 , such 
that 



6ir=4Vi 



k=l 



> 



V* + j. (38) 



Define the functions 

e 1 / 



fi(x)^—d-\\x-s^il i = 0,l,...,N-l, 
sx V n 

Since e < (sxi/n/8) r and {sxCililo 1 c X these functions 
lie in J*Li p , and each f. L is uniquely minimized at x* = sx£,i 
with /* = 0. Moreover, upon defining 



it follows from ( |38] l and the triangle inequality that d(fi, fj) > 
2e 1 / r for all i ^ j, and that this d satisfies the condition 
( fT8) > on the set {/i}^ 1 - Hence, if there exist some T G N 
and an algorithm A G ^ T (V) that attains |20|, we can apply 
Lemma Q] to obtain 



J(M; Mr) > (1 - 5) log AT - log 2 
(l-<J)n-8. 



> 



8 



log 2. 



Next we will bound I(M; Mt) from above. For any x G X 
and any pair we have 



|/,(x)-/,(x)| 2 < 



8e 2 ' r 



n 

8£ 2/r » 



16 - & 



/? 

_ 32e 2 / r 
n 

< 32s 2/r 



fc=i 



fe=i 



and any subgradient of fi at x has £2 norm not exceeding 
(e l l r / sx)\/&/n. Hence, applying Lemma [Hj we get 

16r 2/r 

' (FOG) 



I(M; M T ) < 



a' 
16Te 2 / r 
na z st 



nst 



(40) 



(SOG) 



Combining ( |39] l and ( |40) l and rearranging, we get ( |37| >. ■ 

Upper bounds on stochastic gradient descent - an algorithm 
which only uses the subgradient information - for r = 1 are 
of the form O (G 2 D x /e 2 ), where G 2 is an upper bound on 
the expected squared norm of the noisy gradient (3). As we 
show below, this is matched by our lower bounds. Indeed, 
G 2 oc na 2 for the additive Gaussian noise with variance a 2 . 
For the unit sphere we thus obtain Vt(na 2 /e 2 ); for the unit 
hypercube we obtain fi (n 2 a 2 /e 2 ) for the SOG oracle: 

Corollary 1. Under the conditions of Theorem [2] we have 



X = pBl 



K$\e) 



n 



n 



2 2 

n z p z 



H 

2 2 
n p 



np'- 



a 



np 



a 
a" 

2 ' ^ifr 



np 



(FOG) 
(SOG) 

(FOG) 
(SOG) 



When the functions in T are strongly convex (cf. Ex- 
ample [3}, rather than convex Lipschitz, the complexity of 
optimization will decrease: 

Theorem 3. Consider the problem class V = (X, J-gc>0) 
with a first-order or gradient-only Gaussian oracle, where X G 
K™ with n > 16. Then for any r > 1, e < (ns x /16) , <5 G 
(0,1/2), the following bounds hold: 

((1-6)71 -8) log 2 cr 



4 r) (M)> 



256(L>£ 
((l-*)n- 



1) 

)log2 



256 



e i/r = 
a 2 

£ l/r ■ 



(FOG) 



(SOG) 



(41) 



Proof: Given n, construct the set {£ , • • • , £/v-i} C 
{— 1,+1}" as in the proof of Theorem [2] and define the 
functions 

2 



We 1 /" 



i = 0,l,...,7Y-l. 



(39) Since e < (ns 2 /16) r , 



16e 1 / r 



N-l 



c 



16c 1 /' 



B" C SX B" C X. 



Thus, the functions /j lie in J~sc, an d eacn fi is uniquely 
minimized by x* with /* = 0. Moreover, upon defining 



it follows from p8\ and the triangle inequality that d(fi, fj) > 
2e 1 / r for all i ^ j, and that this d satisfies ( fT~8] > on the set 
{/il^Io 1 - Hence, if there is some T G N and some „4 G 
2tr(7 ? ) satisfying ( |20] >, we can apply Lemma[T]to obtain ( (39| . 

Now we will derive an upper bound on I(M; Mt). For any 
x G X and any pair i, j, we have 

= 2 Hi x — r 1 1 + 1^-^*11 1 ■ |ik-^*n - b 11 1 

< Dx\\x* - x*A\, 
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where the last step uses the definition of Dx and the triangle 
inequality. Thus, 



\fi(x) - f 3 (x)\ 2 < D*\\x* -x*\\ 2 < 64D^ r . 



Also, 



\\Vf i {x)-Vf s {x)f = \\x\-x)f<M E V r . 
Therefore, applying Lemma [5] we get 

32(£> 2 + l)Te 1 / r 



I(M;M T )< 32T£l/ - 



(FOG) 
(SOG) 



(42) 



Combining ([39) and |42|, we get ( |4"Tj ). 

Corollary 2. Under the conditions of Theorem [i] we have 



O 



np 2 + 1 e 1 l r 

„2 



rl/r 



p 2 + 1 e 1 /'' 



rl/r 



(FOG) 
(SOG) 

(FOG) 
(SOG) 



C. Noisy oracles satisfying a moment bound 

Our information-theoretic technique can be used to give a 
simpler derivation of the lower bounds obtained by Nemirovski 
and Yudin [1 Ch. 5] for Lipschitz convex functions and noisy 
first-order oracles satisfying a certain moment constraint. 

Let X = BJ^ and T = J%ip> an d consider the class of all 
noisy first-order oracles whose output Y = (V°, V 1 ) G Rxl™ 
satisfies the following two conditions: 

. (CI) It is unbiased, i.e., E[F°|/, x] = f(x), E[V 1 \f, x] G 

a/(x),V/eJ,ieX. 

• (C2) There exist constants a > 1,L > 0, such that 

E[\V° - f(x)\ a \f,x] <L a , E[\\V l \\ a \f,x] <L a 

for all / € J 7 , x G X. 
We will denote the class of all such oracles by H(a,L). 

Theorem 4. There exists an oracle O G Tl(a,L), such that 
the corresponding problem class V = (X, satisfies 



clog 2 



(43) 



for all e G (0, min{L/2 1 / Q , 1}). 5 G (0, 1/2) with some c = 
c(a, L) > 0. 

Proof: Define two functions fo(x) = —£, T x and fi(x) = 
£ T x, where ^ G K™ has all coordinates equal to e/n, and 
consider the following noisy oracle defined by Nemirovski 
and Yudin [1, p. 198]. Choose a constant c > such that 
c 1 "" < min{L Q /2, 1}, and let p £ . a = ce 01 ^ -^. On the 
set J-\{fo, fi}, this oracle acts noiselessly, while on the set 



{/o, /i} it acts as follows: given fi, i G {0, 1}, and x G X, it 
outputs 



Y 



(0,0), 



with probability 1 — p e 



Pe,Kfi( x )> V /i( a; ))' with probability p e 



It is an easy exercise to show that this oracle belongs to 
n(ce, L); moreover, on the set {/o,/i} this oracle can be 
realized in the form ([8) with ip(fi,x) — (fi(x), V/,(a;)), 
*G{0,1}. 

Consider an algorithm A that achieves ( |20| > with r = 1. Let 

U t = tl>(f M ,X t ). Then I(U t ; Yt\X f , Y 1 ^ 1 ) < I(U t ;Y t \X t ) 
because (X f 1 Y t ^ 1 ) — > Ut — > Y t is a Markov chain. Now, 
given X t = x t , Ut can take only two values, namely 
(~Cx t , -£) or (Cx t ,0- Thus, H(U t \X t ) < log 2. Moreover, 
since the mutual information I(A; B\C) is convex in Pb\a,C< 
we have 

I(U t -Y t \X t ) < p^ a H(U t \X t ) < p £ , Q log2. 
Summing over the T rounds and using Lemma [2] we get 

T 

I(M;M T ) < ^I(U t ]Y t \Xt) < Tce a /^ log 2. 
t—i 

From Lemma [l] we have I(M;Mt) > log 2 — ft,2 (<5) ■ Com- 
bining these bounds and rearranging, we get ( |43| ). ■ 
The statement of Theorem |4] should be interpreted in the 
following sense (cf. also [1]): given X and T as above, 

sup K^ w) (s)^n(e- a ^-^). 
oen(a,L) 

Thus, we have a lower bound which is robust relative to 
II(a, L). However, this bound is sharp only for a G (1,2] 
Jl], @, ijToJ : for a > 2, the correct bound is 0(l/e 2 ). This 
can be easily seen from the results of the preceding section 
on the Gaussian first-order oracle. 



D. Statistical estimation and sequential experimental design 

Finally, let us see how the problems of parametric statistical 
estimation (Example |4]i and experimental design (Example [5]) 
can be viewed through the lens of optimization. 

Let us consider statistical estimation first. We will use the 
notation of Example [4] Typically, one considers the setting in 
which the statistician gets T i.i.d. samples Y% , . . . , Yp drawn 
from some Pg, where 9 is unknown. The quantity of interest 
is the minimax risk 

i?^infsu P E e (fe(0 T )-fe(0) 

where the infimum is over all measurable estimators 6t ■ 
Y T — > X, and fg is an element of an appropriate class T 
of loss functions, such as ( fTO) or 

Now, the output of any such estimator can be viewed as the 
final result of some algorithm A G %t{P)- Thus, we simply 
follow our general recipe and isolate a finite subset T' = 
{fg , ■ ■ ■ , such that, with a suitably defined "distance" 

d(-,-) that satisfies (|18), we have 



d{fg l ,fg j )>2e 1 '\ VijLj. 
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Suppose that we can arrange things in such a way that the 
cardinality of such an T' is independent of e, but may still 
depend on n and r: N = N(r, n) (this is possible in many 
cases, cf. the proofs of Theorems [2] and [3]). Then we simply 
apply the IR bound to get 



log N(r, n) 



(44) 



v maxjj D(P e .\\P ej ) 
Assuming, as is often the case, that 

maxD(P 0i \\P ej ) = C(r,n)e^ r 



for some C(r,n) > 0,7 > 0, we can invert |44} and obtain 
the minimax lower bound 



- log N{r,n) \ rh h S 
, C(r,n) J 



Similar considerations apply to sequential experimental 
design as well (Example IS}, except there we have a design 
strategy {X t : Y t_1 -> X}^ and the estimator 9 T : Y T -> X, 
where at time t = 1, . . . ,T we choose the design point 
Xt — XtiY 1 ^ 1 ) and obtain a sample Y t ~ Pe,X t > an d then 
at time T + 1 we process all the samples to get the estimate 
Xt+i = 9t(Y t ). The minimax risk is then 



R* T = inf 



supE(/ fl (0 r )-/e(0) 



The connection to optimization is even more apparent than in 
the estimation setting, and the IR bound technique yields 



K^(s)=Q 



log N(r,n) 



maxjj sup^x D(Pg % , x \\Pg^ x ) 



Just as before, this bound can be inverted to get a lower bound 
on the minimax risk. Note, however, that what we have is a 
lower bound on the number of the design points needed to 
guarantee that the minimax risk is below e. 

VI. Lower bounds for anytime algorithms 

Conceptually, our use of the IR bounds is akin to the 
methods used in statistics to obtain minimax lower bounds 
through local entropy estimates and a device like the Assouad 
lemma (cf. |5| and references therein). In both cases, in order 
to get the right rates it is essential to arrange things so that the 
size of the "packing" set T' is independent of e. However, one 
drawback of the IR bounds is that they do not take into account 
the dynamics of the algorithm, pertaining to the manner in 
which its expected error evolves with time. Instead, we must 
use uniform, worst-case bounds on the uncertainty remaining 
after each successive oracle call. However, it could be argued 
on practical grounds that the only optimization algorithms that 
are of any value are the ones whose performance gradually and 
monotonically improves with time, as more and more queries 
are issued — that is, anytime algorithms. In this section, we 
show that the LF bounds can be used to track the evolution of 
the mutual information over time. As a consequence, we will 
be able to derive upper bounds on the anytime exponent for 
certain problem classes. 



We will show that the amount of information extracted by an 
anytime algorithm at each time step obeys a law of diminishing 
returns: as the queries X t approach the minimizer, the rate at 
which the algorithm can reduce its uncertainty about the objec- 
tive function slows down. Moreover, assuming that the worst- 
case expected error of such an algorithm decays polynomially 
with time, we will obtain lower bounds on the rate of this 
decay. We will also show that, in some cases, insisting on the 
anytime property may mean that the algorithm will take longer 
to get to the point after which its expected error drops below 
some desired level. This seemingly strange conclusion reflects 
the fact that, without placing any restrictions on the algorithm's 
trajectory, we are allowing "bizarre" (and not very practical) 
strategies that wander around the problem domain for a while, 
gathering information without much regard to how close they 
are to a minimizer, and then — boom! — produce an excellent 
solution. With such algorithms, it is certainly no surprise that 
they may hit upon a good solution more quickly than an 
"honest" anytime algorithm that must proceed incrementally 
and inexorably towards a minimizer. 

In contrast to the local technique based on the IR bounds, 
our use of the LF bounds in this section can be thought of 
as a global technique [6], |28], [ 30 1 . The main idea is as 
follows. Suppose we have an anytime algorithm whose worst- 
case expected errors decay at some rate e t — > 0. Then, for each 
T, we consider an et -packing of the problem domain (with 
respect to a suitable metric, typically just the usual Euclidean 
norm || • ||), which will induce a packing of the function class 
T . This packing will be of size D,(e^ n ). Since the algorithm 
does well on every single function in J 7 , it must necessarily 
do well on every function in this large packing set. Thus, if 
the objective function is drawn uniformly at random from this 
set, then combining the lower bound of Lemma [T] with the LF 
upper bound will result in a relation of the form 



n log 



-< 



where 7 > depends on the smoothness of the functions in 
T . This relation must hold for all but finitely many values of 
T. The optimal rate is then derived by balancing the entropy 
nloge^ 1 and the sum of diminishing mutual information 
terms. 



A. Strongly convex functions 

We first consider the case of strongly convex functions with 
Lipschitz-continuous gradients (an often made assumption J3], 
p7[). Given X, let J Kj i denote the set of all functions / : X — >• 
M that satisfy the following conditions: 

• Each / e J- k .l is K-strongly convex (cf. Example [3). 

• For each /, the mapping x 1— > V/(x) is L-Lipschitz: 

||V/(x)-V/(y)|| <L\\x-y\\, Vz,yeX. 

Consider the noisy first-order oracle Y = (f(x) + W, Vf(x) + 
Z) as in ( |35] l. Let {/ , . . . , /jv-i} C J- k ,l be a finite set of 
functions such that /q = . . . = f^ r _ 1 = c* and V fi(x*) = 0, 
where x* is the (unique) minimizer of fa on X. Let M denote 
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the uniform random variable on {0, . . . , N— 1}. Then we have 
the following (the proof is in Appendix |B): 



Lemma 6. Af every fime £ 
0l oo (T > ) satisfies 



1,2,..., any algorithm A G 



7(M;y t |X*,Y* 



< 



(L/«) 



maxE err_4(i, 



(45) 



Moreover, if A is 1-anytime (cf. Definition^, then 



J ->0. 



(46) 

Thus, the decay of the expected error in minimizing a strongly 
convex function is accompanied by the decay of the average 
information gain, and, moreover, the two quantities decay 
at the same rate. In other words, anytime algorithms for 
strongly convex programming obey a law of diminishing 
returns. Evidently, this phenomenon is due to the fact that, 
as the algorithm zeroes in on the minimizer, the signal-to- 
noise ratio keeps dropping because the mean-square error and 
the mean-square norm of the gradient both decrease as O(et). 
Using Lemma [6] in conjunction with the information bounds 
of Section [V] we establish the following upper bound on the 
anytime exponent of strongly convex programming problems: 

Theorem 5. Consider the problem class V — (X, T, 0) with 
X = B 7 ^, J- = T Kt L with k = 1 and L > 1, and the Gaussian 
first-order oracle ([35). Then ~/j> < 1. In other words, on this 
problem class, 0(t _1 ) is the optimal error decay rate for all 
1-anytime algorithms whose errors decay polynomially with t. 

Proof: Consider any algorithm A G %i o{'P) whose worst- 
case errors e t = erf^i, J 7 ) satisfy 

limsupt 7 £t < oo 

t— voo 

for some 7 > 0. In other words, e t = 0(t r ). By Markov's 
inequality, we have, for every T, 

v> ( <rr t\^i \ / su P/e ^Eerr^(T,/) 1 
sup Pr err^T, /) > 3e T < - < -. 



Let us fix some T, let At = {Oo, 
2 -packing set in X (w.r.t. | 



■ , 0/v-i} denote a maximal 
||), and define 

i = 0,...,N- 1. 



By volume counting, N > v n 1 (1/3£t)™^ 2 - We also have 



<Hfi,fj) = h\\9i-0 j \\ 



> 



I(M; M r ) > - lo 



6er- By Lemma [TJ 
1 



(47) 



where c„ = | log f g^g^a ■ On the other hand, applying 
Lemma |6j we obtain 



7(M; M T ) < 



1 T 

-E 

*=i 



(48) 



Combining (47 1 and (48 1, we see that the sequence {e t } must 
satisfy the following inequalities: 



t=i 



where c' n — cr 2 c n /(n + 1). From Lemma |C.l| we therefore 
conclude that there exists an infinite subsequence of times 
1 < tx < £2 < • • •> sucn that e tj > ctj 1 for some c > 0. 
Since e t = 0(t -7 ) by hypothesis, we must have 7 < 1. ■ 
The bound il(t _1 ) is tight and can be achieved by stochastic 
gradient descent [3 1. Note that the methods of Section[V]can be 
used to explicitly identify the dependence of the lower bound 
on the problem dimension n. 

B. Comparison of IR and LF bounds 

A natural question is whether LF bounds for anytime 
algorithms provide tighter lower bounds when compared to 
IR bounds without the anytime assumption. This is indeed the 
case, as we demonstrate through an example. Interestingly, the 
difference is not present for linear and quadratic functions, but 
appears for higher degree polynomials. Consider a simple set- 
up with X = [-1, 1], 

T = T m = {fe(x) = (x~9) m :9eX} 

for some even m £ N, and the noisy first-order oracle 

Y=(f(x) + WJ'(x) + Z) 

where W, Z are mutually independent A/"(0, a 2 ) random vari- 
ables. 

Theorem 6. On this problem class, for any e < 1 and any 

5 e (0, 1/2) we have 

(49) 



4 m m 4 e 1 /™ 



Proof: Let us define 



V0, 6' e X. (50) 



By convexity of x n- x m (recall that m is even), we have for 
any x, 9,9' eX 



9') m = 2 
< 2 



— x x — 



2 2 

"' - 1 [(x-9) m + (x-ey 



Hence, the d(-, •) defined in < |50j > satisfies (TTSj. In particular, 
if we fix the functions 

fo{x) = (x- £ 1/2 "') m and h{x) = (x + e l ! 2m ) m 

then e?(/o,/i) = 2s 1 / 2 . Let M have uniform distribution on 
{0,1}. Consider now any algorithm A £ Slr('P') that attains 
err 2 4 (T, fg) < e with probability at least 1 — 6 for every 9 € X. 
Applying Lemma [T] we obtain 

I(M;M T ) > log 2- h 2 (S). (51) 

On the other hand, from Lemma [5] we have 



rrGX 



{[/o(a 



/(M; Mr) < — 2 sup \ [f (x) - f^x)] 2 + [f (x) 
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From convexity, 

[fo(x) - A Or)] 2 = [(a; - e 1/2m ) m - {x + s 1 ^)""]' 
< m 2 4 m e 1/m . 

Likewise, applying the mean-value theorem to the function 

x i y x" 1 ^ 1 , we get 



lf'(x)-f[(x)}' 



m 



(x - eU^m-l _ (aj + gi/am) 



1 /2m\m— 1 



< [m(m - l)] 2 ^™" 1 ^ 1 /™. 



Thus, we obtain the upper bound 



I(M; M 7 



O 



(T4 m m i e 1 / r 



Combining ( |5T| and |52|, we obtain ((49 



(52) 



(2) 

Theorem 7. Consider the same problem class. Then 7^, < 

m 
m— 1 * 

Proof: As before, consider any algorithm .A € SlooC^ 1 ) 
whose worst-case errors e t = err? (£,.7-") satisfy 



limsupt 7 £t < 00 

t— > 00 



for some 7 > 0. Applying the same argument based on 
Markov's inequality as in the proof of Theorem [5] we see that, 
for any T, .4. attains err 2 4 (T, /#) < 3et with probability at 
least 2/3 on every f e G J 7 . Given T, let A T = {6 , • • • > #/v-i} 
denote the largest finite subset of X = [—1,1], such that 



\0i ~ 0j\ > 2(3e T ) 



l/2r. 



Vi 7^ j. 



l/2m 



A simple counting argument shows that N > 

Moreover, the functions /j(a;) = (x — 9i)"\i = 0, . . . ,N — 1, 
satisfy fj) > 2y/Ser for i 7^ j, where •) is defined 
in (J5DJ). 

By Lemma [T] 



/(M;M T ) > -^logf — 
3m V £t 



where c m = | log ( gi/^g ) ■ We will now combine this lower 
bound with an appropriate LF bound. Let Q* denote the 
bivariate normal distribution A/"(0, cr 2 ^)- Then, for every 
i = 0, . . . , N — 1 we have 

Y {1) =(fi(6i) + W, fUBi) +Z) = (W, Z)~Q*. 

Hence, applying the LF bound with Q y = Q* as in the proof 



of Lemma [6] we obtain 

I(M;M T ) 

<tAX>{/I/(A^ + [/m(^)] 2 } 



t=i 

T 



^J2E{(X t -9 M f m + m 2 (X t 
t=i 

t=i 

<^E{ e [m^)-/m] 2 

t=l 

+ to 2 (E[f M (X t ) - r M } 2 )^ } 

T 

-^(E[/ M (X t )-/^] 2 )^ 



? M ) 2(m - 1) } 



(53) 



T 



< 



2a 2 
n 2 + l 



-^(Eerr 2 ,(i,/ M )) m " ;1 



2a 2 

t=i 



T 



where in <[53j we have used the concavity of the function 
u 1 ^ Therefore, we conclude that the sequence 

{£*} must satisfy 



2a 2 / 1 

3(to 3 + to 2 ) 



2cr 2 c, 



for all sufficiently large T. Applying Lemma C. 1 we conclude 
that there exists an infinite subsequence of times t\ < t% < . . ., 
such that e^. > ctj m ^ m 1%> for some constant c > 0. Since 
£t = 0(i~ 7 ) by hypothesis, we must have 7 < ^rj. ■ 
For m = 2, the two results indicate the same order of 
complexity, T y e^ 1 ^ 2 ; however, for to = 4 and larger, the 
bounds differ, giving T y e~ x l m for arbitrary algorithms and 
T y si 1 -" 1 )/" 1 for anytime algorithms, which is larger. We 
conclude that, in general, the LF bounding technique leads 
to tighter bounds for optimization algorithms which actually 
converge monotonically to the optimal solution. 



C. Active learning 

Our technique for analyzing anytime optimization algo- 
rithms can also be used to give a particularly simple derivation 
of the minimax lower bound for active learning of a threshold 
function on the unit interval [22) . In general, active learning 
is more difficult than (convex) optimization. However, for the 
case below, we can apply the tools developed in this paper. The 
reason for including this example is twofold: first, to show that 
problems beyond convex optimization can be attacked with our 
information-theoretic method, and second to exhibit a problem 
with a noise model more complicated than those encountered 
so far in the paper. 
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The active learning problem is stated as follows. We have a 
pair (A, Z) of jointly distributed random variables A € X = 
[0, 1] and Z E {0, 1}, where the marginal distribution Px is 
uniform on [0, 1], while the conditional distribution P z \x is 
unknown. We do, however, have some prior knowledge about 
Pz\x- Define r](x) = E[Z\X = a;]. Then we assume the 
following: 

• There exists some 9 E [0,1], such that t](x) < 1/2 for 
x < 6 and r/(x) > 1/2 otherwise. In other words, the 
Bayes classifier G* (x) = l{ ri ( x )>i/2} for this problem 
(31] is of the form G*{x) = G e {x) = 1{ X > 6 }- 

• For some 0<c<C<l/2 and n € [1, oo), we have 

c\x - 9\ K ~ l < \n(x) - 1/2| < C\x - 0\ K -\ (54) 

where the first inequality (known as the Tsybakov noise 
condition [ 32 ]) holds for all i in a sufficiently small 



neighborhood of 9. 
Let II(k, c, C) denote the class of all conditional probability 
distributions Pz\x satisfying these two conditions. We wish to 
determine the unknown threshold 9 using an active strategy: 
at time t, we request a label Z t E {0, 1} at a point X t E X, 
chosen as a function of the history (A , Given our 

query X t , the label Z t is generated at random according to 
P z i x (-\Xi). At time t, the candidate classifier is Gx t (x) = 
l{ x >x t }- The performance of the strategy after t time steps 
is measured by the excess risk relative to G* : 



R(G Xt )-R(G*) = f 

J[x t ,i] 



A [9,1] 



|2r?(a;) - l\dx, (55) 



where A denotes symmetric difference between sets. (The 
risk of a classifier G : X — » {0, 1} is defined as R(G) = 
Pr(G(A) ^ Z), and the Bayes risk is R{G*) = m£ G R(G) 

Castro and Nowak |22| have shown that any active strategy 
will have excess risks of £!(i -K /( 2K ~ 2 )), and gave an explicit 
scheme that achieves the rate 0(<~ k /( 2k ~ 2 '). Their proof of 
the lower bound relies on an intricate construction of two 
distributions P^jj x , P^? x E II(k, c, C) that are close in a 
statistical sense, but far apart in the sense of their Bayes risks. 
We now show that the same lower bound can be derived using 
our machinery without any careful function tuning. To that 
end, we will cast this problem in the optimization setting, as 
alluded to in Example [6] Let X and T be as described there, 
and associate to each Pz\x G H(k,c,C) a noisy oracle with 
Y = {-1,+1} and P(Y = l\f,x) = P(Y = l\6,x) = r){x). 
With this correspondence in place, we can now prove the 
following: 

Theorem 8. Let n E (1,2]. Suppose that there exists an active 
learning strategy satisfying 

sup K[R(G Xt ) - R(G*)} = 0{r r ) 

Pzix£~n( K ,c,C) 

for some 7 > 0. Then 7 < k/(2k-2). Thus, 0(t-"/( 2 "- 2 )) is 
the optimal decay rate for all active learning strategies whose 
excess risks decay as Poly(t _1 ). If k = 1, then the excess risk 
is ft(2- 6c2t )0 ' 

3 The exponent in this lower bound is not tight, since there exists a specific 
strategy that achieves the excess risk of 0(2 — c * lo s e / 2 ) when k = 1 |22|. 



Proof: For each 6 E [0, 1], find some Pt x E U(k, c, C), 
such that the inequalities in {54] hold for all values of x E 
X. Given a candidate classifier Gx t , consider the excess risk 
R(Gx t ) — R{Gq)- Assume for now that > X t . Then from 
p5|) and d54]i we get 



R(G Xt )-R(G e ) >2c [ 



xT^dx 



2s ( 

K 



x t ) K . 



The case X t < 9 is similar. Thus, the expected excess risk of 
any strategy at time t can be bounded as 



E[R(G Xt ) - R(G e )} > (2c/«)E|X t 



(56) 



Now suppose we have a learning strategy whose worst-case 
excess risks decay at a prescribed rate {rt}: 

sup E[R(G Xt )-R(G*)]=r t , t = l,2,... 
p z , x en(K, c ,c) 



Then from this and (56i we have that, for every P z \ x , 
strategy satisfies 

E\X t ~9\ K <Kr t /2c, i=l,2,... 



this 



(57) 



Let e t = (3«;r t /2c) 1 / K . Then using (57 1 and Markov's 



inequality, we see that for this strategy we must have 

sup Pt (\X t - 9\ > e t \9) < 1/3, Vt = 1,2,.... 

ee[o,i] 

In other words, this active learning strategy gives rise to an op- 
timization algorithm A for the problem class V = (X, J 7 , O), 
where O is specified by P(Y = l\9,x) = E [Z\X = x], and 
there exists some Tq > 1 such that Pr(err^(t, /) > e t ) < 1/3 
for all t > T . 

Now for each T > T let A T = {9 0) . . • , 9n-i} be a 
maximal 2er-packing of [0,1]. Simple counting shows that 
N > l/2e T . Consider the set P = {f m = f Sm : 9 E A T } C 
T, and denote r\ m {x) = Eg m [Z\X = x}. Then, in our usual 
notation, we have from Lemma [TJ that 



J(M; M T )>| log -| log 2. 



(58) 



Next we apply Lemma [2] To that end, let us inspect the terms 

/(MjYtlX*,^- 1 ): 

I(M; F^A*, y* _1 ) 

= I{M, X t ; Ft|X* _1 , Y*~ x ) - I(X t ;Y t \X t ~ 1 ,Y t ~ 1 ) 
< I(M, X t ; FflA*- 1 , r*" 1 ) < I(M, X t - Y t ), 

where the first step uses the chain rule, the second is because 
mutual information is nonnegative, and the third is because 
(A*- 1 ,^*- 1 ) -> (M, X t ) -> Y t is a Mai-kov chain. Now 
we use the LF bound with Q* the uniform distribution on 
{-1, +1}. Then 



I(M,X t ;Y t ) < D(¥ YtlMiXt \\QY t \ ¥ M,x t ) 

<4E M)Xt {(P(y t = l|M,X t )- 
= 4E M ,x t {\vM(Xt)-l/2\ 2 } 



1/2) 2 } 



<4C 2 E MA |A t -(9 



M 



2(k-1) 



(59) 
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where in the second step we used the fact that 

d(p\\l/2) 4 p log 2p + (1 - p) log[2(l - p)] < 4(p - 1/2) 2 



for all p £ [0, 1], and in the last step we used pty - Suppose 
first that k > 1. Because n < 2, the function a; H ► x ( 2k ~ 2 ^ k 
is concave, and we can write 

E\X t - 6 M \ 2 ^- 1] < (E\X t - 6 M n 2{K - 1}/K . 

Using this in conjunction with ( |57) i and Lemma [2] we can 
bound the mutual information I(M; Mt) as 



I(M; M T ) < 4C 2 ^ 



V 2c 



4C 2 



E £ * (K_1) 



(60) 



Combining ( 58 i and ( 60 1, we have 

3 (k-2)/k 



2C 2 



loe 



5-3^ , 9< ,>f 2(K _i) 
:;,,/ ir-> |U -' > - ' ■ 



An inequality like this must hold for all T > Tq. Lemma C.l 



then states that there exists an infinite subsequence of times 
1 < ti < t 2 < such that e tj = Q, (tj 1/( - 2K ~ 2) \ or, 

equivalently, that r tj = (t^ k /( 2k 2 '^. Since by hypothesis 
r t = 0(i~ 7 ), we must have 7 < k/(2k — 2). 

When k = 1, from J59J we have I(M,X t ;Y t ) < AC 2 for 



all t. This, together with (58 1, gives 



1 lo s(^)-i lo s 2 ^> 



VT > T, 



o- 



6C* 2 ~° V er / 12 
which gives e T = 0(2- 6C " 2t ) and r T = 0(2" 6C " T ). ■ 

VII. Concluding remarks 

Sequential optimization algorithms operating in the pres- 
ence of uncertainty must be able to accumulate information 
in order to reduce uncertainty. As we have shown in this 
paper, there are fundamental limitations on the rate at which 
this uncertainty can be reduced, depending on the richness 
of the class of objective functions faced by the algorithm, 
the noisiness and the structure of the oracle that supplies 
information to the algorithm, and the manner in which the 
algorithm may approach the optimum (i.e., monotonically or 
not). In order to derive these fundamental limitations, we 
have developed a comprehensive information-theoretic ma- 
chinery that makes use of the fact (which we have proved) 
that the problem of sequential optimization is, in a certain 
sense, at least as hard as hypothesis testing with feedback 
(or with controlled observations). This observation then leads 
to quantitative estimates that relate the minimum number of 
oracle queries needed to achieve a given level of accuracy 
to the overall reduction of uncertainty about the objective 
function being optimized. The latter is measured by the mutual 
information between the random choice of the objective and 
the history of algorithm's queries and oracle's responses. 
Carefully taking into account all the Markovian structures that 
are imposed by the sequential and the adaptive nature of the 
algorithm, we can obtain different upper bounds on this mutual 
information. 



Using this machinery, we have derived tight lower bounds 
in several settings in optimization, both for arbitrary and for 
anytime optimization algorithms (in some cases improving 
upon existing results), and beyond, e.g., for experimental 
design and active learning. One promising direction for future 
work is to consider algorithms with query costs, i.e., when 
issuing each query incurs a cost that may depend on the query, 
and the goal is to balance the total cost of querying with the 
final optimization error. Recent work by Naghshvar and Javidi 
(33j considers a hypothesis testing problem of this kind by 
relating it to optimal stopping for a Markov decision process, 
and the techniques developed in that work may be useful for 
deriving information-theoretic lower bounds for optimization 
problems with query costs. 



Appendix A 

Finite-step vs. strong infinite-step algorithms 

As we pointed out in Section|II] our definition of an infinite- 
step algorithm is somewhat restrictive, as it allows only the 
algorithms that use their most recently computed candidate 
minimizer as the next query. The following definition removes 
this restriction: 

Definition A.l. A strong infinite-step algorithm for a problem 
class V = (X, J 7 , O) is a sequence of mappings A — {At ■ 
yt-i x y*- 1 _>. X x X-}%± v The set of all infinite-step 
algorithms for V will be denoted by Sloo^)- 

The interaction of any A £ StooC'P) w ^ tn ^ ^ s described 
recursively as follows: 

1) At time t = 0, a problem instance / £ T is selected by 
Nature and revealed to O, but not to A. 

2) At each time t = 1,2, .. .: 

• A computes 

(X t ,X t )=A t (X t -\Y t - 1 ), 

where X T and X T are, respectively, the query and 
the candidate minimizer at time r. 

• O responds with a random element Y t £ Y accord- 
ing to P(dY t \f,X t ). 

In other words, both X t , the candidate minimizer at time t, 
and X t , the query at time t, are computed on the basis of all 
currently available data, i.e., (X^ 1 1 Y t ~ 1 ), yet the algorithm 
has more freedom, since at time t + 1 it can query the oracle 
with an arbitrary point, rather than just X t . The error of A on 
/ £ J- at time t is given by 



err 



A 



(tJ) = f(X t )-mff(x) = f(X t )-r 



Definition A.2. Fix a problem class V = (X, J 7 , O). For any 

r > 1, e > 0, and 5 £ (0, 1), we define the rth-order infinite- 
step (e, (S)-complexity and the e-complexity ofV, respectively, 
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as 



Then from (A3 



l4 r),0O M) = mf {T > 1 : 3A G ^{V) 

s.t. supPr (err^(i,/) >e) <£,Vt >t|; 



A'. 



(0. 



(e)^inf{r> i laieaoo^) 

s.t. sup E err'"* (i,/) < e,V* > t). 



It turns out that these notions of complexity are equivalent to 
the ones introduced earlier: 

Proposition A.l. For any problem class V and all r > 1, 

e > 0, 5 G (0, 1), we Zzave 



A' 



(r) ,oo , 



) ( £ ) = 4 r) (^)- 



(A.l) 
(A.2) 



A 



Proof: We only prove ( |A.1| >, since the proof of ( |A.2| i is 
similar. Likewise, we will only consider the r — 1 case. 

First we prove that KS?{e, 5) < K-p(e,5). We can assume 
that K-p(e,5) < oo, for otherwise the inequality holds a 
fortiori. Given e and 5, consider any T for which there exists 
some T-step algorithm A G 21tCP), such that 

supPr(enu(T,/)> E ) < 5. 

Given A, we can construct a strong infinite-step algorithm 
A G 2looCP) as follows. Choose an arbitrary T-tuple 
(xi, . . . , air) G X T and let 

Atix*- 1 ^- 1 ) 

= ((A t (x t - 1 ,y t - l ),x t ), t = l,...,T; 

" \ (AT + i(x T ,y T ),A T+ i(x T ,y T )), t > T. 

Then it's clear that for any t > T 

sup Pr(errj(i,/) > e) 

= supPr(/(l t )-/*> £ ) 

= sup Pr (/(.4 T+1 (X T , F T )) - f* > e) 

= supPr(err^(T,/) > e) 

< <5. 

Hence, K^(e,S) < K v (s,S). 

Next, we prove K'p(e,S) < K^(s,S). Again, we can 
assume that K2?(e,5) < oo. Consider an algorithm A G 
%oo{V), such that 

sup sup Pr ( err^i, /) > e) < S (A3) 

t>T fer 

for some T. Let 111 and II2 denote the two coordinate 
projection mappings from X x X onto X, i.e., Hi(x,x') = x 
and H 2 (x, x') = x' , and define A G ^&t^P) by setting 



Hi o At 



t = l,. 



sup Pr (evT A (T, f) > e) 

= supPr(/(X T+1 )-/* >e) 

= sup Pr (f(A T+ i(X T : Y T )) -f*>e) 

= sup Pr ( errj(T + 1, /) > e) 

<6, 

which implies K v (e, S) < K£>(e,5). ■ 

Appendix B 
Miscellaneous proofs 

A. Proof of Proposition [7] 

Given e and V, consider any T for which there exists some 
algorithm A G SIt('P) that satisfies sup /eJF Eerr^(T, /) < e. 
Then Markov's inequality gives 

Pr ( err^ (T, /) > e/S) < Eerr ^^ < 5, V/ G T. 

Hence, T > K^\s/5,S). Taking the infimum over all such 
T, we arrive at the proof. 

B. Proof of Lemma [3] 

First, we modify the construction of the probability space 
P) in Section IV by introducing the random variables 



U T G U T that describe the responses of the "clean" (determin- 
istic) oracle ip : T x X — > U to the queries X T . The relevant 
causal ordering is 

M, Xx,U x ,Y u ...,X t , U u Y t , ...,X T , U T , Y t , X t+1 , 

where, P-almost surely, we have ( fT3j ) and 



F(X t G A\M,X t - 1 ,U T -\Y t 



- 1 {^ t (Jf*- 1 ,y t - 1 )eA} 



U 2 oA t +i, t = T + l. 



P(y* G B\M,X\ = Q(B\U t ) 

for all A G Bx, B G £>y,C G By. That is, (Af, [7* -1 ) 
(X*-i,y*-i) -> X t , (x*- 1 ,CT*- 1 ,Y t - 1 ) -> (M,X t ) -> Di, 
and (M,X*, C/^y*- 1 ) -> [/ t -> F t are Markov chains for 
each t. Then we can write 

I(M; Y t \X\ y*- 1 ) = I(M, Uu Y t \X\ y*" 1 ) 

because {Jj is completely determined by M and via XJ% = 
ip(fM,X t ). Moreover, 

/(M.c/^yi^y*- 1 ) 

= i(r/ t; y t |x { , y*- 1 ) + i(m- Y t \u u x\ y*- 1 ) 
= /(c/ t; y|x t ,y*- 1 ), 

where the first step is by the chain rule and the second step 
is due to the fact that M -> (U t , X\ Y^ 1 ) -t Y t is a 
Markov chain. This follows by applying the weak union and 
the decomposition properties of conditional independence 1 34 
p. 11] to the Markov chain (M, X\ f/*" 1 , Y 1 ' 1 ) -> J7 t y. 
By the same token, (X 4 , y t_1 ) -t U t -*Y t is also a Markov 
chain, so we have I(U t ; Y t \X\ Y^ 1 ) < I(U t ;Y t ). 



19 



C. Proof of Lemma |4] 

Let us fix some t and consider the conditional mutual 
information term I(M; Y t \X l , F* -1 ) in the summation in 
Lemma [2j 

/(M^tlX*,^- 1 ) 



D( 

■ E 

E 
D 



itlAf.X'.r'- 1 iFFtlX'.y*- 1 FM,X',5"- 







log- 



dp 



dP 



log- 



dP 



Y t |M,X*,r t - 1 



-E 



log- 



FtlxSy*- 1 



ytlM.x'.y*- 1 ||w t |x*,y*-i |- r Af,x t ,y*- l y 



(B.4) 



(B.5) 



) (B.6) 
(B.7) 



where ( |B.4| i follows from ( |B.5| l and ( |B.6| ) are justified 
by virtue of ( |28) l, while ( |B.7| i follows from the fact that the 
divergence is nonnegative. 

D. Proof of Lemma [5] 

The random variables V° = f M (X) + W and V 1 = 
gM{X) + Z are conditionally independent given M = i and 

X = x: 



Y\M=i,X = 



M=i,X=x <» Jrv 1 |M=i,X=£ci 



where 



M=i,X=x = A/"(/ 4 (x),(T 2 ) 

y 1 |M=i,x= K = N{gi{x), a 2 I n ). 

=j,X=x) 



Therefore, 

-D (P y | M=i ,X =2 1 1 Py | Ai=.j 
= U(Pv°|M=i,X=alF V°|M=j,X=£c) 

+ £>(Pvi|Ar =il x=x||]Pvi|M=i,x=«) 
= i?(AA(/ 4 (.T), ( T 2 )||AA(/ J ( a ;),a 2 )) 

+ D (N( 9i (x) , a 2 I n )\\N{g 3 {x),a 2 I n )) 
= ^{{Mx)-Mx)} 2 + \\g t (x)-g 3 (x)\\ 2 }. 
Plugging this into ( |3Tj >, we get ( j36] >. 

£. Proof of Lemma [6] 

Let Q* denote the product normal distribution Af(c* , a 2 ) ® 
7V(0, fJ 2 /„)- Observe that, for every i = 0, 1, . . . , N - 1, 

F« = + W, V/i«) + Z) = (c* + W, Z) ~ Q*. 

Let X t denote the query of A at time t and let Y t be the 
corresponding oracle response. Then 

Py t | M =i,x i = x t = Af (/; (x t ) , a 2 ) ® JV( V fi (x t ) , a 2 /„ ) . 

Hence, applying the LF bound ( [32) ) with Qy = Q*, we can 
write 

IiM-^X^Y*- 1 ) 

< ^ maxE { [fi(X t ) - c*] 2 + || V/<(X t )|| 2 } . (B.8) 



We now relate the right-hand side of ( |B.8| l to the performance 
of ^4. First of all, by convexity of fi, 

f i {X t )-c* = f i {X t )-f i (x*) 

<vMx t y(x t ~ x *) 

<\\Vf(X t )\\\\Xt-x*\\ 
<LD x \\X t -x*\\, (B.9) 

where in the last step we have used the fact that 

Hv/iC^)!! = yv/iC^) - v/^oil < i||jc t - <|| < i£» x . 

On the other hand, from strong convexity we have that 

fi(X t ) >c* + (K 2 /2)\\X t - x*\\ 2 . (B.10) 
Combining ( |B.9| > and (JbTTO]), we therefore obtain 



ifi(X t ) - c*} 2 < 2D 2 x (L/ K ) 2 [MX t ) - c*\ 
= 2D x (L/ K ) 2 err A (t,f t ). 

Moreover, because V fi(x*) = 0, we can write 

llv/^ll'HIv/^O-v/^*)!! 

<L 2 \\X t -x*\\ 2 
<2(L/K) 2 eiT A (tJ t ). 



(B.ll) 



(B.12) 



Substituting ( |B.ll| i and ( |B.12| > into ( |B.8| l, we get ((45]). Eq. ((46 

is immediate from definitions. 



Appendix C 
Lemma on functional recurrences 

Lemma C.l. Suppose that {e t } is a sequence of nonnegative 
reals satisfying 

T 

JHog^) VT 

for some K,L,a > 0. Then there exists some constant < 
c < (K/a) 1 ^ 01 , such that et > ct~ x l a for infinitely many 
values oft. 

Proof: The proof is by contradiction. Suppose first that 
t 1/a s t < c for all t. Then 



mog 



rpl I a 



T 1 

L<c a Y,~ t <c Q (logT+l), VT. 



ReaiTanging, we get 

'K 



c a )logT < K\ogc + L + c a , VT. 



Since K/a — c a > 0, this implies that logT is bounded for all 
large and positive T, which is impossible. Hence there exists 
some set S C N, such that s t > ct~ 1/a for all t £ 5. We now 
show that 5 must necessarily be countably infinite. Suppose, 
to the contrary, that it's finite. Then there is some To, such 
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that e t < d,- 1 / 01 for all t > T . In that case, for T > T , we 
can write 

rnl/a ( T ° T i \ 

KXo g T —-L<c«[^+ £ \) 

\t=l t=T + l J 



c a K(T ,a) + 



t=T n + l , 



<c a {K(T ,a) +logT-logT ) 
Rearranging, we see that the inequality 

'K 



a 



c a j logT 

<K\ogc + L + c a (K(T 0l a) ~ log T ) 



must hold for all T > Tq. Since K/a > c a by hypothesis, 
this implies that logT is bounded for T > To, which is, again, 
impossible. Thus, e t > ct~ x / a for infinitely many values of t. 
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